In [ ]:

    
%%HTML
<style>
.container { width:100% }
</style>

Linear Regression with SciKit-Learn

In this notebook we need both sklearn and pandas. These can be installed using the following commands:

conda install scikit-learn
conda install pandas

We import the module pandas. This module implements so called data frames and is more convenient than the module csv when reading a csv file.



In [ ]:

    
import pandas as pd

The data we want to read is contained in the csv file 'cars.csv'.



In [ ]:

    
cars = pd.read_csv('cars.csv')
cars.head()

We want to convert the columns containing mpg into one NumP arry, while the remaining numerical attributes should be collected into a feature matrix.



In [ ]:

    
import numpy as np



In [ ]:

    
X = np.array(cars[['cyl', 'displacement', 'hp', 'weight', 'acc', 'year']])
Y = np.array(cars['mpg'])

Let us inspect the first five rows of the matrix X.



In [ ]:

    
X[:5]

Since miles per gallon is in a reciprocal relation to the fuel consumption, we convert Y to its inverse.



In [ ]:

    
Y = 1 / Y

We import the linear_modelfrom SciKit-Learn:



In [ ]:

    
import sklearn.linear_model as lm

We create a linear model.



In [ ]:

    
M = lm.LinearRegression()

We train this model using the data we have.



In [ ]:

    
M.fit(X, Y)

The model M represents a linear relationship between the dependent variable $1/\texttt{mpg}$ and the independent variables $\texttt{cyl}$, $\texttt{displacement}$, $\texttt{hp}$, $\texttt{weight}$, $\texttt{acc}$, and $\texttt{year}$ of the form $$\displaystyle \frac{1}{\texttt{mpg}} = \vartheta_0 + \vartheta_1 \cdot \texttt{cyl}

          + \vartheta_2 \cdot \texttt{displacement} 
          + \vartheta_3 \cdot \texttt{hp}
          + \vartheta_4 \cdot \texttt{weight}
          + \vartheta_5 \cdot \texttt{acc}
          + \vartheta_6 \cdot \texttt{year}

$$ We proceed to extract the coefficients $\vartheta_i$ for $i\in{1,\cdots,6}$.



In [ ]:

    
ϑ0 = M.intercept_
ϑ0



In [ ]:

    
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6 = M.coef_
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6

Let us check how much of the variance is explained by our model.



In [ ]:

    
R2 = M.score(X, Y)
R2

The linear model explains $88\%$ of the variation of the fuel efficiency. In order to derive a better model, we would need both the reference area of the car and the drag coefficient.



In [ ]: